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Abstract — Our goal in this paper is to plan the motion of a 
robot in a partitioned environment with dynamically changing, 
locally sensed rewards. We assume that arbitrary assumptions 
on the reward dynamics can be given. The robot aims to 
accomplish a high-level temporal logic surveillance mission 
and to locally optimize the collection of the rewards in the 
visited regions. These two objectives often conflict and only a 
compromise between them can be reached. We address this 
issue by taking into consideration a user-defined preference 
function that captures the trade-off between the importance 
of collecting high rewards and the importance of making 
progress towards a surveyed region. Our solution leverages 
ideas from the automata-based approach to model checking. 
We demonstrate the utilization and benefits of the suggested 
framework in an illustrative example. 

I. Introduction 

In this paper, we consider the problem of robot path 
planning (see, e.g., [1] for an overview) with more complex 
missions than "Go from A to B while avoiding obstacles.". 
Recently, different versions of temporal logics, such as Lin- 
ear Temporal Logic (LTL), Computation Tree Logic (CTL), 
or /i-calculus have been successfully employed to specify 
such robotic missions [2], [3], [4], [5], [6], [7], [8]. We 
have chosen LTL [9], [10] as the specification means for its 
resemblance to natural language and its ability to express in- 
teresting robot behavior, such as "Repeatedly survey regions 
A and B while avoiding dangerous regions. Make sure, that 
A is always visited in between two successive visits to B 
and vice versa.". 

We assume that the robot motion in the environment 
is modeled as a transition system, which is obtained by 
partitioning the environment into regions (for instance using 
well-known triangulations and rectangular partitions). Each 
region is modeled as a state of the transition system and the 
robot's capability to move between the regions as transitions 
between the corresponding states. Our transition system is 
deterministic, i.e., a control input for the robot is the next 
region (state) to be visited. Moreover, the transition system 
is weighted, i.e., each transition is equipped with the time 
duration this transition takes. 

The robot's task is to collect rewards that dynamically 
appear, disappear and change their values in the environment 
regions and that can be sensed only within a certain vicinity 
of the robot's current state. A traditional approach to this 
kind of problem, i.e., an optimization problem defined on 
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a dynamically changing plant, is model predictive control 
(MPC) [11]. The method is based on iterative re-planning 
and optimization of a cost function over a finite horizon and 
hence, it is also called receding horizon control. 

In this work, we focus on interconnecting the receding 
horizon control with the synthesis of a path that is provably 
correct with respect to a given temporal logic formula. This 
idea appeared in [12], [13], where the receding horizon 
approach was employed to fight the high computational 
complexity of reactive motion planning with a specification 
in GR(1) fragment of LTL. However, the authors did not 
consider any rewards collection to be optimized. In contrast, 
the authors in [14] addressed a similar problem that we do. 
They assumed a deterministic weighted transition system 
with locally sensed rewards changing according to an un- 
known dynamics. While they required the satisfaction of an 
LTL mission, they also targeted to collect maximal rewards 
locally, within a given horizon. These two goals often cannot 
be reached simultaneously. If the robot primarily collects 
high rewards, the mission might never be satisfied and vice 
versa, if the robot is planned to accomplish the mission, the 
collected rewards might become low. The authors utilized 
ideas from the automata-based approach to model checking 
in order to iteratively find a local path maximizing the 
collected rewards among the local paths that ensure that a 
step towards the mission satisfaction is made. This way, they 
managed to compromise between the two goals. 

Our work can be seen as a different, generalized approach 
to the above problem. We allow the trade-off between the 
two goals to be partially driven by user-defined preferences 
that may dynamically change during the execution of the 
robot. In particular, we assume an LTL mission that includes 
surveillance of a set of regions and a user-defined preference 
function expressing the desired trade-off between the surveil- 
lance and the reward collection given the history of the robot 
motion. In other words the preference function determines in 
each moment whether moving towards a surveyed place or 
optimization of the collected rewards is of a higher priority. 
Whereas the local path planned in [14] always guarantees 
progress towards the satisfaction of the mission, in our case 
this progress may be deliberately postponed (for a finite 
amount of time) if the collection of the rewards is prioritized. 
For example, consider a garbage truck that is required to 
periodically visit two garbage disposal plants A and B and 
to arrive to a plant as fully loaded as possible. In [14], 
each local plan for the truck would send the truck closer 
to A (or B, respectively) and the truck might arrive half- 
empty. In contrast, through the preference function, we can 



define that collecting the garbage is the primary target until 
the truck is full enough to drive to a plant and that once 
it is, driving towards A (or B, respectively) becomes the 
priority. Besides that, we generalize the problem from [14] 
in the following sense. The authors there assumed that the 
reward dynamics is completely unknown. Therefore, when 
planning, they estimate that the rewards collected along a 
planned local path would be equal to the sum of the rewards 
that are currently seen at the states of this path and they 
aim to maximize it. We consider that arbitrary assumptions 
on the reward dynamics might be given and we estimate the 
rewards collected along a planned local path accordingly. We 
also allow for a broader class of optimization functions. 

In our solution, we leverage ideas from the automata-based 
approach to model-checking to provably guarantee the sat- 
isfaction of the mission and we introduce several extensions 
that allow us to support both the preference function and 
the arbitrary assumptions on the reward dynamics. We build 
a so-called product automaton that captures all the runs of 
the transition system that satisfy the mission. We employ 
the preference function to compute the attraction of states 
in the product automaton and at each time, we choose the 
most attractive state to be visited next. While the value of 
the preference function is low, the robot is primarily driven 
by the sensed rewards. However, as the preference function 
grows, the surveillance is prioritized and the attraction forces 
the robot to move not only towards the surveyed regions, but 
also towards accepting states of the product automaton, i.e., 
towards the satisfaction of the global specification. 

Our contribution can be summarized as follows. We de- 
velop a general framework for robot motion planning with 
high-level LTL mission specifications and locally optimal 
reward collection with respect to given reward dynamics 
assumptions and local rewards sensing. We introduce a 
novel approach that allows to prescribe whether the rewards 
collection or the mission progress are of a higher interest. We 
present several illustrative examples and simulation results to 
demonstrate the usability of our approach. 

The rest of the paper is organized as follows. In Section [III 
we review necessary preliminaries. In Section III the prob- 



lem is described in detail and stated formally. In Section IV 



we present the solution, correctness and completeness proofs 
and discussions on the solution optimality. In Secti on M a 
case study is introduced and we conclude in Section |Vl| 

II. Notation and Preliminaries 



In this section we introduce notation and preliminaries 
used throughout the paper. 

Given a set S, we denote by S + and S w all finite, nonempty 
and all infinite sequences of elements from S, respectively. 

Definition 1 (Weighted Deterministic Transition System): 
A weighted deterministic transition system (TS) is a tuple 
T = (Q,qo,T,IL,L,W), where 

• Q is a finite set of states; 

• qo E Q is an initial state; 

• TCQxQisa transition relation; 

• II is a set of atomic propositions; 



• L : Q — >• 2 n is a labeling function; and 

• W :T — >• M>o is a weight function. 

The states of the transition system represent the regions of 
the environment and the transitions represent the robot's 
capabilities to move between them. We assume that there 
is a transition from each state. The atomic propositions are 
properties that are either true or false in each region of the 
environment, for instance "This region is a pickup/delivery 
location.". The labeling function L assigns to each state the 
set of atomic propositions that hold true in this state. The 
weight function assings to each transition the amount of time 
that this transition takes. If the robot is in a state q at time 
t and follows a transition (q,q r ) E T, then it is in the state 
q' at time t + W((q, q 1 )). The time spent in states is 0. 

A run of T is an infinite sequence p = q q\ ■ • • sucn 
that q is the initial state and (qi,qi+i) € T, for all % > 0. 
A finite run pfi n = qt ■ . .qj of T is a finite subsequence 
of a run p — qo ... qi ... qj ... of T. A run prefix p p fix of 
T is a finite run that originates at the initial state qo. For 
simplicity, we denote by q G p (q E Pfi n ) the fact that the 
state q occurs in the run p (the finite run pg n ). Associated 
with a run p = q$qi . . . (and a run prefix p p R x = qo . . . q n ) 
there is a sequence of time instances toti . . . (and to . . . t n ), 
where to = 0, and t{ denotes the time at which the state 
is reached (t i+1 = U + W(q l ,q i+1 )). A run p = q qi . . . 
generates a unique word u>(p) = L(q )L(qi) . . .. A control 
strategy C : Q + — > Q for T assigns the next state to be 
visited to each run prefix of T. The run generated by C is 
p = qoqi . . ., such that qi = C(qo ■ ■ ■ Cfo-i), for all i > 1, 

With a slight abuse of notation, we use W(ps n ), where 
Pfin = qt ■ ■ ■ qj 1 is a finite run of T, to denote the total weight 
of the sequence of transitions (fr+i), . . . , (Qj-i,Qj), i.e., 
W(pfin) = Ed^(fe>?w))' Furthermore, we define 
W* (qi , qj ) as the minimum weight of a finite run from qi 
to qj. In particular, W*(qi,qi) = 0, and W*(qi, qj) = oo if 
there does not exist a finite run from qi to qj. 

Definition 2 (Linear Temporal Logic): A linear temporal 
logic (LTL) formula cf) over the set of atomic propositions II 
is defined according to the following rules: 

</>::= T|7r|^|0V0|0A0|X0|</>U0|G<?(>|F0, 

where T is always true, it E II is an atomic proposi- 
tion, -i (negation), V (disjunction) and A (conjunction) are 
standard Boolean connectives, and X (next), U (until), G 
(always) and F (eventually) are temporal operators. 
The semantics of LTL is defined over infinite sequences 
over 2 n , such as those generated by the transition system T 
from Def. [T] Assume that <fi, <\>\, and (f>2 are LTL formulas 
over II and uj = cj(0)w(1) . . . E (2 n ) w is a word generated 
by a run p of T. The word cj satisfies an atomic proposition 
7r if 7r holds in the first position of u>, i.e., if it E w(0). 
The formula states that 4> needs to hold next, i.e., for 
the word uj(1) . . .. The formula cf>i U cf>2 means that cf>2 is 
true eventually, while 0! is true at least until (f> 2 becomes 
true. Formulas G (f> and F <fi state that <fi holds always and 
eventually, respectively. More expressiveness can be achieved 
by combining the operators. A detailed description of LTL 



can be found in [10]. As expected, a run p of T satisfies <fi 
if and only if the word uj(p) generated by p satisfies <fi. 

Definition 3 (Bilchi Automaton): A Biichi automaton 
(BA) is a tuple B = (S, s , S, 6, F), where 

• S is a finite set of states; 

• so € S is an initial state; 

• £ is an input alphabet; 

• (JCSxSxSisa transition relation; and 

• F C 5 is a set of accepting states. 

The semantics of a Biichi automaton is defined over infinite 
input words. Note that if £ — 2 n , then the input words 
are infinite sequences of sets of atomic propositions, such 
as those generated by T. A run of B over an input word 
a = ao&i . . . £ is a sequence of states g = soS% . . . such 
that so is the initial state and (s^, a^, s^+i) G S, for all i > 0. 
A run g is accepting if and only if a state from F appears in 
q infinitely many times. A word a is accepted by the Biichi 
automaton if there exists an accepting run over a. 

For any LTL formula cf) over IT, there exists a Biichi 
automaton B^ with input alphabet 2 n accepting all and only 
the words satisfying formula <fi. Algorithms for translation of 
an LTL formula into a corresponding Biichi automaton were 
proposed [15], and several tools are available [16]. 

Definition 4 (Weighted Product Automaton): A weighted 
product automaton between a TS T = (Q, qo, T, II, L, W) 
and a BA B^ = (S, s , 2 n , S, F) is a tuple V = T x = 
(S-p , s-po ,S-p,F-p, Wp), where 

• S-p = Q x S is a set of states; 

• s-po = (<Zo; s o) is the initial state; 

» S-p C Sp x Sp is a transition relation, where 
((g, s),(g',s')) € £p if and only if (q,q') € T and 
(s,L(<z),s')e£; 

• F-p = Q x F is the set of accepting states; and 

• W-p : Sp — > M>o is a weight function, 
where W v (((q, s), (q' , s'))) = for all 
((g,s),(g',s)) e Sp. 

Note that the product automaton defined above is a 
weighted version of a standard Biichi automaton with a 
trivial alphabet that is thus omitted. We denote by a(g-p) 
the projection of a run g-p of V onto its first components, 
i.e., a((q , s )(q 1 ,s 1 ) . . . ) = <j <?i .... An accepting run g v 
of the product automaton V can be projected onto a run 
a(g-p) of T that satisfies the formula cf), and vice versa, if 
P = qoqi ■ ■ ■ is a mn °f T satisfying 0, then there exists an 
accepting run g v = (q , s ){qi,s 1 ) ... of V . 

The product automaton can be also viewed as a tran- 
sition system Tp = (Sp, sp ,Sp 7 H, L-p,W-p), where 
L-p((q, s)) = L(q), for all (q, s) € 5-p. Hence, the objects 
that are defined on a transition system are defined on the 
product automaton in the expected way. Namely, we use 
W-p(gp>s n ), W^(pi,pj) and C-p (gvpSx) to denote the total 
weight of a finite run g-p^ n , the minimum weight between 
states pi, pj and the control strategy for V, respectively. 

III. Problem Formulation 

Consider a robot moving in a partitioned environment 
modeled as a weighted deterministic transition system. The 



states of the transition system correspond to individual re- 
gions of the environment and the transition between them 
model the robot motion capabilities. Assume, that there 
is a dynamically changing non-negative real-valued reward 
associated with each state of the transition system. The robot 
senses the rewards in its close proximity and collects the 
rewards as it visits the regions of the environment, i.e., as the 
states of the transition system change. Moreover, the robot 
is given a high-level LTL mission. The problem addressed 
in previous literature [14] is to design a control strategy that 

(1) guarantees the satisfaction of the mission and (2) locally 
maximizes the collected rewards. 

We focus on a different version of the above problem 
allowing for partial regulation of the trade-off between the 
two objectives. In particular, first, we consider a user-defined 
preference function that, given a history of robot's move- 
ment, expresses whether moving closer to a region under 
surveillance or collecting rewards is prioritized. Second, we 
consider arbitrary reward dynamics that might be unknown, 
known partially or even fully. We capture the concrete reward 
dynamics assumptions through a so-called state potential 
function. The problem we address is to design a control 
strategy that (1) guarantees the satisfaction of the mission, 

(2) locally optimizes the collection of rewards, and (3) takes 
into consideration the preference function and the reward 
dynamics assumptions. 

We formalize the problem as follows. The robot motion in 
the environment is given as a TS T = (Q, qo, T, II, L, W) 
(Def. [TJ. The rewards can be sensed at time t% within the 
visibility range v £ E>o from the robot's current position q^. 
We denote by V(qk) — {q \ W*(qk,q) < v} the set of states 
that are within the visibility range v from q^ (assuming that 
q £ V(q k ), for all (q k ,q) £ T) and by R : Q x Q+ -> 
R>o the reward function, where R(q, qo . . . q k ) is the reward 
sensed in the state q at time t k after executing the run prefix 
qo . . . q k - Note that R(q, q . . . q k ) is defined iff q £ V(q k ) 
and it is known only at time t k (and later), not earlier. 

A user-defined planning horizon and a state potential 
function are employed to capture user's assumptions about 
the reward dynamics and her interests. For instance, the 
values of the rewards may increase or decrease at most 
by 1 during 1 time unit, they may appear according to a 
probabilistic distribution, or their changes might be random. 
The user might have full, partial or no knowledge of the 
reward dynamics. The rewards might disappear once they 
are collected by the robot, or they might not. The user might 
be interested in the maximal, expected, or minimal sum of 
rewards that can be collected from a given state during 
a finite run whose weight is no more than the planning 
horizon. The concrete definitions of the planning horizon 
and the state potential function are meant to be specifically 
tailored for different cases. Formally, the horizon is h £ K>o, 
h > max( ?i g') e T W(q,q') and the state potential function is 
pot : Q x Q + x R >0 —> M>o, where pot(q,q ■ ■ ■ q k ,h) 
is the potential of the state q at time t k . More precisely, 
the value of pot(q, qo . . . q k , h) is defined for all q, where 
(lk,q) € T and captures the rewards that can be collected 



after execution of the run prefix q a . . . q k during a finite run 

PSn G Pfin{q,qk, h), where 

P&n(l> Qk,h) = {pa n I pa n is a finite run of T, such that 

(i) pfin originates at q; 

(ii) W(p &a ) + W((q k ,q)) < h; and 

(iii) the states that appear in pa n belong to V(q k )}. 

Note, that the visibility range v and the planning horizon 
h are independent. Whilst v determines the set of states 
whose rewards are visible from the current state q k , h gives 
the maximal total weight of a planned finite run p& a within 
V(qk), which can be even greater than v. 

Example 1: The function stating that the potential of q 
is the maximal sum of rewards that can be collected from 
q assuming that the rewards do not change while the robot 
can sense them and that they disappear once collected is 

pot(g, q . . . q k , h) = max V R(q ,q . . . q k ). 

9'Spfin 

In fact, this is how authors in [14] estimate the amount of 
rewards collected on a local path. 

To define our problem, we assume that there is a set 
of regions labeled with a so-called surveillance proposition 
7r sur G IT and a part of the mission is to periodically fulfill 
the surveillance proposition by visiting one of those regions. 
The missions are then expressed as LTL formulas of form 

= A GFtw, (1) 

where <p is an arbitrary LTL formula over IT. The subformula 
G F 7r SUI states that the surveillance proposition 7r sur has to 
be visited always eventually, i.e., infinitely many times. Note, 
that formulas <f> = (p A G F T hold true if and only if <p hold 
true and therefore the prescribed form does not restrict the 
full LTL expressivity. 

The user can partially guide whether the robot should 
collect high rewards or whether it should rather make a step 
towards the satisfaction of the surveillance proposition 7r sur 
through a preference function. For example, the preference 
function can grow linearly with time since the latest visit 
to 7r sur , meaning that going towards 7r sur gradually gains 
more importance. In contrast, the value of the preference 
function can stay low until the latest visit to 7r sur happened no 
later than 100 time units ago and after that increase rapidly, 
expressing that the robot is preferred to collect rewards for 
100 time units and then to move towards 7r sur quickly. 

Formally, the preference function pref : Q + — > R> 
assigns a non-negative real value to each executed run prefix 
qo ■ ■ .qk of T (possibly) taking into account the current 
values of the state potential function. 

Example 2: An example of a preference function is 

pref (q a . . . q k ) = 0.01 • W, ■ max pot(q, q . . . q k , h), 

where Wi = W(qi . . .q k ), such that 7r sur G L(qj), and 
7r sur g" L(qj), for all i < j < k. If the surveyed state is 
being avoided, the total weight Wi since the last visit to a 
surveyed state gradually grows and eventually, the value of 



pref(<7o • ■ • q k ) overgrows the value of pot(q, q . . . q k , h) for 
all q. 

A shortening indicator function I indicates whether a tran- 
sition leads the robot closer to a state subject to surveillance. 
I : T {0, 1} is defined as follows: 

fl if min W*(q',q^)< min W*(q,q„), 
I((q,q')) = i fee« : feeOi. 
^0 otherwise, 

where (q, q') G T and = {q n | 7r sur G L(q„)}. 

We are now ready to formally state our problem. 

Problem Formulation 1: Given the robot motion model 
T = (Q) qo, T, II, L, W); the surveillance proposition 7r sur G 
IT; the visibility range v; the reward R(q, qo ■ ■ ■ q k ) at time t k , 
for all q G V(q k ); the planning horizon h; the state potential 
function pot; the LTL formula <fi over IT (Eq. [TJ; and the 
preference function pref, find a control strategy C, such that 

(i) the run generated by C satisfies the mission <j> and 

(ii) assuming that q = C(qo . . . q k ), the cost function 

pot(g, g ■ • • qk,h) + l((q k ,q)) ■ pref(g ■ ■ ■ q k ) (2) 

is maximized at each time t k . 

Intuitively, condition (ii) is interpreted as follows. At each 
time, the aim is to go to the state with the best trade-off 
between the amount of potentially collected rewards and the 
importance of fast surveillance. The higher the value of the 
preference function, the more likely a state closer to 7r sur is 
to be chosen. Note that, in general, the satisfaction of the 
condition (ii) may cause violation of the objective (i). Our 
goal is thus to provably guarantee accomplishment of the 
mission and to maximize Eq. |2j if possible. 

Our approach leverages some ideas from the automata- 
based solution from [14]. However, several issues have to 
be overcome to support the user-defined trade-off as it 
will become clear in the following section. The solution 
consists of two consecutive steps. The first one is an of- 
fline preparation before the deployment of the system. It 
involves a construction of a BA for the given LTL mission 
and its product with the TS. The offline algorithm assigns 
two Boolean indicators to each transition of the product 
automaton, which indicate whether the transition induces a 
progress to a subgoal, i.e., a surveyed state of the transition 
system and both a surveyed state and an accepting state 
of the product automaton, respectively. In the second step, 
an online feedback algorithm, which determines the next 
state to be visited by the robot, is iteratively run. In each 
iteration, attractions of the states of the product automaton 
are computed. The repeated choices of the maximal attraction 
states lead to an eventual visit not only to a surveyed state, 
but also to an accepting state of the product automaton, 
assuming that the following holds: 

Assumption 1: For each run q§q\ . . ., with the property 
that 3ni, Vm > n%: 7r sur ^ L(q m ), it holds that 3n2, Vm > 
n 2 : pref (q . . . q m ) > pot(q, q . . . q m , h), for all q, where 
(q m ,q) G T. 

As we will show in Sec. IIV-CI the satisfaction of the LTL 
mission is guaranteed provided that the above assumption 



is true. From now on, we assume that Assump. [T] holds. 
Intuitively, it says that if a visit to a surveyed state is 
postponed for a long time, the value of the preference 
function overweights the value of the state potentials. Note 
that this is, in fact, quite natural. It only captures the fact, 
that the user who defines the potential and the preference 
function wishes to satisfy the LTL formula in long term 
and therefore her interest in making a progress towards the 
satisfaction of the formula at some point naturally prevails 
her interest in collecting the rewards. Several examples of 
pot and pref functions that respect this assumption will be 
shown in Section [V] 

IV. Solution 

In this section, we give the details of our solution to 
Problem [T] and prove its correctness and completeness. 
Discussions on the optimality of the solution are included, 
too. 

A. Offline Indicator Asssignment 

Let £>0 = (S, So, S, S, F) be a Biichi automaton corre- 
sponding to the LTL formula (j> = ip A G F 7r SUI (Eq. [T| 
and V = T x = (S-p, S-po,8-p,F-p,W-p) the product 
automaton constructed according to Def. |4] 

Let S-p-x = {(<?, s) G S-p | 7r sur G L(q)} denote the subset 
of states of V that project onto the surveyed states in T. 
Furthermore, let C Fp and S^ C Sp n be the sets 
of states from which Sp n and Fp can be visited infinitely 
many times, respectively. Sets Fj? and Sg^ can be computed 
iteratively as the maximal sets of states from which a state in 
S^ and F^? is reachable via a finite run of nonzero length, 
respectively (see Alg. [T] lines |2|9|. 

Lemma 1: A run Qp of V is accepting iff a state from F^ 
and a state from S^ appear in Qp infinitely many times. 

Proof: Let Qp = Qp(0)Qp(l) ■ ■ ■ be an accepting run 
of V, i.e., a run with infinitely many visits to Fp. Note that 
there is a state in S-p % that appears in Qp infinitely many 
times, because Qp satisfies <f> and hence also GF7r sur . Then 
there exist infinite index sets /, J C N, where Qp(i) G Fp, 
Qp(j) G Sp^, for all i € I and j G J . For each state 
Qp(i) G Fp,i G / there exist infinitely many states Qp(j) G 
Sp-x where i < j G J, and analogous holds for each state 
Qp(j) € Sp^,j G J. Hence, all states Qp(i), Qp(j) where 
i € I,j € J belong to F^ , S!p n , respectively. On the other 
hand, if a state from Fp 3 occurs on Qp infinitely many times, 
then Qp is clearly accepting. ■ 

For each state p G Sp we define the minimum weight of 
a finite run from p to a state from S^ 



Algorithm 1 Indicator assignment algorithm 
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W£(p,p') 



(3) 



and the minimum weight of a finite run from p to S^ 
containing a state p' G F£? 



Wp Flt {p,p') = min 
Moreover, we define 



(w^(p,p')+W^(p',p") 



(4) 



(5) 



Input: V 
Output: V 

1 

2 



(S-p, S-po, Sp,Fp, W-p) 

(Sp, Spo, Sp,Fp, W-p), Iptt,It- 



fp .— fp, Sp w .— S-Ptv 

while fix-point of Fp , Sp^ not found do 

for all p G Fp, s.t. min Wp(p',p") = oo do 

remove p from Fp 
end for 

for all p G SSL, s.t. min W£ (p',p") = oo do 

(p,p')e<W'e-Fp 

remove p from Sp n 
end for 
end while 

for all pe S v , s.t. Wp n (p) = oo V (p) = (oo, oo) do 

remove p together with incident transitions 
end for 

for all (p,p') G Sp do 

compute I Pw ((p,p')) , Ip<f,((p,p')) (Eq.^ 
end for 



where p' G Fp° minimizes W-^(p,p') among the set of 
states that minimize Eq. |4| Given Wjj^pi) = and 
W^, (p 2 ) = (u 2 ,v 2 ), Wf^ipi) < W^{p2) if and only if 
i*i < u 2 and i?i < v 2 . 

Note that each state p E Sp with W^, 7T (p) = oo or 
Wpi(p) = (oo, oo) cannot occur on any accepting run of V. 
Therefore, we assume from now on that V contains only 
states p G Sp with W-p 7r (p) ^ oo and Wp^p) ^ (oo, oo). 

Lemma 2: Vp G Sp \ S^ w ,3p' G Sp : (p,p'),E Sp, 
W*Jp) > W*>')> and Vp G Sp \ F™,3p' G S v : 

(p,j/),e* P ,W55^( P )>W5S^(|/). 

Proof: Follows directly from Eq. [3] [4] and [5] ■ 

We are now ready to define the shortening indicator 
functions Ip^^Ip^: Sp — > {1,0}, which indicate whether a 
transition induces progress towards the set S^ and towards 
both the set F™ and the set S™ n via a state in Fp, 



respectively. 



I-Px 



otherwise, 



where x G {ir, <fi}. 

Corollary 1: Vp £ Sp \ Sp 7T ,3(p,p') G Sp, such that 
Ipk{{p,p')) = 1 and Vp G Sp \ F^,3(p,p') G <5 P , such 
that Ip^ip^p 1 )) = 1. 

The outline of the indicator assignment procedure for the 
product automaton V is summarized in Alg. [T] 

B. Online Planning 

The online planning algorithm is run at each tk, such that 
<7o . . . qk is the executed run prefix so far (i.e., till the current 
time tk) and it determines the next state C(qo • ■ ■ <7fc) of T to 
be visited. Simply put, we plan in the product automaton V 
and then we project the planned onto T ■ Formally, T starts in 
its initial state qo and V in its initial state (qo, Sq). For each 
run prefix (q , s ) ■ ■ • (<lk, Sfc) of V, the algorithm computes 
the next state of V, denoted by Cp((qo, sq) ... ((ft, Sfc)) = 
(gfc+i, s fc+ i). The next state of T is C(go . . . = gfc+i. 

To guarantee that the control strategy C generates a run 
of T satisfying <p, it is sufficient to ensure that the control 



strategy Cp generates a run of V that visits F-p infinitely 
many times. In T, the high value of the preference function 
pref was used to guide the robot towards 7r sm . Projected into 
the product automaton, the high value of pref can "send" the 
robot towards a state in S^ n . We expand this idea and use 
the preference function to guide the robot not only towards 
S!p n , but also towards Fj? . This way, we ensure that is 
indeed visited infinitely many times. 

In particular, we introduce two subgoals in V. The first 
one is the mission subgoal, when a visit to Fp° is targeted. 
The second one is the surveillance subgoal, when we aim to 
visit Sp^. At each time, one of the subgoals is to be achieved 
and once it is, the subgoals are switched and the other one is 
to be achieved. Progress towards both subgoals is governed 
by maximization of the attraction function attrp which we 
define for the product automaton in analogous way as the 
cost function (Eq. [2} for Problem [T] 

Consider the product V obtained after the execution of 
the offline preparation algorithm (Alg. [TJ. Assume, that <fi 
is satisfiable, i.e., that Fp° and SS?^ are both nonempty 
and (qo,so) G Sp. The product V naturally inherits the 
rewards from T, i.e., R v {{q, s), (q , s ) . . . (q k , s k )) = 
R{q, qo ■ ■ . qk)- Thus, the value of pot function can be 
computed on the product automaton (or, more precisely, on 
its underlying TS 7p) using i?p. We use potp(p, g p R x , h) 
to denote the value of the state potential function for a state 
p computed on V. 

The value of the attraction attr-p : Sp x Sp x K>o — > 
M>o is computed differently for both subgoals. Initially, 
the subgoal to be achieved is the surveillance one and the 
attraction is 

attr-p (p, gpfix, h) = 

potp(p, £ p fix) h) + Ivir{(Pk,p)) ■ P re f(a(£ ) P fix)), (7) 

where g pfix = p ...p k , (p k ,p) G S-p. For any run prefix 
Po . . .pk, let Cp(po ■ ■ - Pk) be the state with the highest at- 
traction (if there are more of them, we choose one randomly). 
Hence, if the attraction of a state that is not closer to the 
subgoal is higher than the attraction of ones that are, the 
collection of rewards is preferred and vice versa. However, 
note that repeated choices of the states that maximize attrp 
together with Assump. [T] guarantee, that the surveillance 
subgoal, i.e., a visit to Sp? w will be eventually achieved. Once 
it is, the mission subgoal becomes the one to be reached. 

For the mission subgoal, the attraction needs to be defined 
in a different way. The reason is that with an analogous 
definition as for the surveillance subgoal, we would not be 
able to ensure eventual visit to FS?. Intuitively, if 7r sur was 
repeatedly unintentionally visited, the value of pref (a(g p fi x )) 
might not overgrow the value of potp(p, g p f\ x , h), the "non- 
shortening" transitions might be always chosen to follow and 
a visit to Fp° might be infinitely postponed. 

Thus, we define a projection function a that projects 
a run prefix g p fi x of V onto the corresponding run of T 
while removing 7r sur from some of the states. In particular, 
on a(gpfi x ), the proposition 7r sur appears at most once in 
between two successive visits to an accepting state in i*p° . 



Definition 5 (Projection a): Let T — (Q, qo, T, II, L, W) 
be a transition system, where Q — Q U {q | q G Q}; 
if (q,q') G T, then {q,q'),{q,q'),{q,q'),{q,q') G f and 
W(q,_q') = W(q,q') = W_(q,q>) = W{q,q') = W(q,q'); 
and L(q) = L(q), and L(q) — L(q) \ {7r sur }, for all 
q G Q; Let g pRx = (q , s ) . . . {q k , s k ) be a run prefix of V. 
Ppfix(O) = (? ; Ppfix(i) = Qi if TTsur g L(qi) or 7r sur G L(q t ) 
and 3j < i, such that (qj,Sj) G Fp° and 7r sur g 1 L(qi), for 
all j <l < i; and p p a x (i) = q\ otherwise. 

The definition of the attraction for the mission mode is 

attrp (p, g pfix , h) = 

pot v (p,g pfix ,h) + Iv<f>((Pk,p)) ■ pref(<3(p pfix )), (8) 

where g pRx = p ...p k , ip k ,p) G Sp. Similarly as for 
the surveillance subgoal, the state C-p(po . . .p k ) is the state 
maximizing attraction (if there are more of them, we choose 
one randomly). The construction of the attraction together 
with Assump. [T] ensure that the mission subgoal is always 
eventually reached. Once it is, we aim for the surveillance 
subgoal again. If both of the subgoals are reached simulta- 
neously, the surveillance subgoal is set to be reached. 
The outline of the solution to Problem[T]is given in Alg. [2] 

Algorithm 2 Solution to Problem [T] 

Input: T, 7r sur , v, R, h, pot, 4>, pref 
Output: Control strategy C 

1: compute V — T x B4, and run Alg. [T] 

2: if Fj? = or (q , s ) £ S v then 

3: return "Mission cannot be accomplished". 

4: end if 

5: £> P fix := (go, so), subgoal := 7r sur , k := 

6: while true do 

7: for all p, s.t. (pk ,p) G 8-p do 

8: compute attr-p (p, q p r x , h) (Eq. [7] if subgoal = 7r sur 

and Eq. [8] if subgoal = cj>) 
9: end for 

10: Cp(£> P flx) := P maximizing attrp (p, £ pflx , h) 

11: C(a(g pfix )) := a(CV(£> pflx )) 

12: if subgoal = 7r sur and Cv(q p r x ) G then 

13: subgoal := (j> 

14: end if 

15: if subgoal = <j> and Cv(q p r x ) G then 
16: subgoal := 7r aU r 

17: end if 

18: concatenate Cv(g P fix) to £> p fi x ; k := k + 1 

19: end while 



C. Discussion 

In this section, we prove that under Assump. [T] our 
algorithm is correct and complete with respect to the satis- 
faction of the LTL formula (condition (i) of Problem [TJ. We 
discuss the sub-optimality of the solution and we introduce 
an assumption, under which the local plan is optimal with 
respect to condition (ii) of Problem [T] among the solutions 
that do not cause an immediate, unrepairable violation of </>. 

Theorem 1 (Correctness and Completness): Alg. [2] re- 
turns a strategy C that generates a run of T satisfying </> 
if and only if such a strategy exists. 



pot 1 (g,g ...g fe ,ft) = , P ax , , I2i=o h (p(»), 90 • - - 9fc, p(0) . . . p(n)), and 

pot 2 (g, go • ■ .gib, ft) = max ( max fi{p(i), go ■ ■ ■ qk, p(0) . . . p(n)) ) , 

Ko)-p(n)eP fm (5,?i,i>) \.=o,...,n / 

where /i,2(p(i), go • • ■ g*;, p(0) . . . p(n)) = R(p(i),qo ...q k )- W(p(0) . . . p(i)) if this value > 0, p(i) ^ g fc and p(jr') 7^ p(i) for all j < i, 

and fi = 15, f 2 =0 otherwise. 

pref 1 (go • ■ • Ik) = if W(qi^ . . . q k ) < 50, and pot'(go • ■ ■ 5fc, ft) + 1 otherwise, 

pref 2 (g • ■ • Qk) = ^3 ' Wfe, • • ■ 5fc) 3 ' P ot '(<?0 • • • 9fc,ft), and pref ;J (g • • ■ gib) = ^= • \/VK(g l7r . . . g fc ) ■ pot'(<j . . . q k , ft), 

where i w is maximal < i < k, such that g,;^ £ 7r sur and pot' (go ■ • .gib, ft) is the maximal pot(g, go ■ • • Qk> ft) among all g, where (gj,, 5) S T 

TABLE I: Definitions of the state potential and the preference functions used in the case study. 



Proof: (Sketch.) Assume that Alg. [2] returns "Mission 
cannot be accomplished.". Then is empty and according 
to Lemma [T] <p cannot be satisfied in T. Assume that 
Alg. [2] computes a strategy C-p for the product V . We will 
show by contradiction that Cp generates a run g of V 
visiting F^ infinitely many times. Assume that there is a 
finite prefix g p a x = po...pk of g, such that p n ^ F™, 
for all n > k and first, assume that the current subgoal 
is the surveillance one. Then, according to Assump. [JJ 
and the definition of the attraction function, the value of 
PreffaOpfiJ) > P<%>(p, tf pBx , h) for all prefixes g' pfix = 
Po . . . Pk ■ ■ ■ pi of the run j? p fi x , such that I > to, for some 
to > k. This means that the "shortening" transitions will be 
preferred over the "non-shortening" ones since t m and thus, 
Pj E Sj?^ will be reached eventually. Second, assume that 
the mission subgoal is the current one. Then, according to 
Assump. [T] and the definition of the attraction function, the 
value of pref(a(e pfix )) > potp(p, £> pfix , h) for all prefixes 
g' pRx — po ■ ■ ■ Pk ■ ■ ■ Pi °f the run p p fi x , such that I > to, for 
some to > k. Similarly as in the previous case, pj € FS? 
will be reached eventually. Thus the proof is complete. ■ 

In general, the satisfaction of condition (ii) of Problem [T] 
cannot be guaranteed as repeated visits to the state maximiz- 
ing Eq.|2]might prevent the mission to be satisfied. However, 
we reach some level of optimality as disscussed bellow. 

In the attraction definition (Eq. 17), the value of the state 
potential function pot-p (p, po . . . pf., h) is computed in the 
product automaton instead of the transition system. As a 
result, it is computed assuming that only sequences of 
transitions that do not cause an immediate, unrepairable 
violation of the formula can be followed from q. If the 
current subgoal of the online planner is the surveillance 
subgoal, the following optimality statement can be made: 
A state of V maximizing the attraction (Eq. [7J projects 
onto the state of T maximizing the cost function (Eq. [2]) 
taking into consideration only finite runs that do not cause 
an immediate violation of the formula. In contrast, if the 
current subgoal of the online planner is the mission one, we 
cannot claim the similar. First, the indicator function in the 
attraction (Eq. [TJ does not indicate whether a transition of 
the product automaton leads closer to 7r sur , it rather indicates 
whether it leads closer to both an accepting state and 7r sur . 
Second, the preference function in the attraction function 
(Eq. |7| is computed for a(po . . .pk) instead for a(po ■ ■ - Pk)- 
This is necessary for correctness of the algorithm, however, 



as a result, the value of pref(a(po . . .pk)) in the attraction 
(Eq. [7J) might be different than the corresponding value of 
pref (qo . . . qk) in the cost function (Eq. |2|. 

In case F$> = {q' € S v \ q € and (q,tf) € 5 V }, the 
mission subgoal is reached always exactly one planning step 
after the surveillance subgoal is reached. Therefore, we can 
reach the optimality that was stated in the previous paragraph 
for the surveillance subgoal also for the mission subgoal, 
since all the transitions from SS? are always "shortening" 
with respect to F^ . In particular, this is the case if a Biichi 
automaton with the property that all the transitions leading 
to an accepting states are labeled with a set containing 7r sur , 
is used in the product automaton construction. For instance, 
a surveillance fragment of LTL defined in [17] guarantees 
existence of such a BA. The fragment includes LTL formulas 
that require to repeatedly visit a surveillance proposition 7r SU r 
(called an optimizing proposition in [17]) and to visit a given 
set of regions in between any two successive visits to states 
satisfying 7T SUX . In addition, ordering constraints, request- 
response properties, and safety properties are allowed. 

Complexity: The size of a BA for an LTL formula <f> is 
2°(M) in the worst case, where \<j>\ denotes the length of 
the formula tf> [15]. However, note that the actual size of the 
BA is in practice often quite small. The size of the product 
automaton V is OQQ\ •2 0( ^ l) ). A simple modification of the 
Floyd- Warshall algorithm is employed to find the minimum 
weights between each pair of states in OdT 5 ! 3 ). The same 
complexity is reached for the computation of SSL, W£ 
and Wy,^. The shortening indicators I-p^^I-p^ can be com- 
puted in linear time and space with respect to the size of V. 
The overall complexity of Alg.[l]is 0((\Q\ ■ 2 C WD) 3 ). The 
complexity of the online planning algorithm highly depends 
on the complexity of the state potential and the preference 
functions. The set Pfi n (q,qk,h) can be computed in 0(d h ), 
where d denotes the maximal out-degree of states of V. If 
pot and pref functions took constant time to compute, the 
online planning algorithm would be in 0(d-d h ) per iteration. 

V. Example 

We implemented the framework with several concrete 
choices of the state potential and the preference function in 
a Java applet [18]. In this section, we report on simulation 
results to illustrate employment of our approach. 

We consider a data gathering robot in a grid-like par- 
titioned environment modeled as a TS depicted in Fig. [T] 
The robot collects data packages of various, changing sizes 




Fig. 1: A transition system representing the robot (illustrated as the 
black dot) motion model in a partitioned environment. Individual 
regions are depicted as nodes (states). Transmitters are in green 
(labeled with propositions a and b, respectively), unsafe locations 
(labeled with u) are in red. The set of transitions contains every pair 
(q, q') of vertically, horizontally or diagonally neighboring states. 
Weights of a horizontal and a vertical transition are 2, weight of a 
diagonal transition is 3. 



(rewards) in the visited regions. The following is known 
about the reward dynamics: A non-negative natural reward 
can appear in a state with the current reward equal to 0. 
The probability of the fresh reward being from {0, . . . , 15} 
is 50% as well as from {16, . . . , 60} (i.e., the smaller-sized 
data packages are more likely to occur). The reward drops 
by 1 every time unit as the data outdate. The visibility range 
v is 6. For example, in Fig. [T] the visibility region V(qo) for 
the current state qo is depicted as the blue-shaded area. 

The mission assigned to the robot is to alternately visit 
the two transmitters (in green, labeled with propositions a, 
and b, respectively), while avoiding unsafe locations (in red, 
labeled with it). The surveillance proposition 7r sur is true in 
both transmitter regions. The LTL formula for the mission is 

(/) =G(a^X(^aUfe)) A G(6^X(^6Ua)) A 
G(->w) A G F 7r sur . 

In our simulations, we consider the planning horizon h = 
9 and several variants of the state potential function and the 
preference function that are summarized in Table [I] The first 
state potential function potj is the maximal sum of rewards 
that can be collected on a finite run while taking into account 
the reward behavior assumptions described above. If the run 
visits a state more than once or a reward of a state drops 
below 0, we assume the reward there is 15. The second state 
potential function pot 2 is defined as the maximal size of a 
single data package that can be collected on a finite run. 

The respective ratio of the value of pref and the maximum 
value of pot is always non-decreasing with the time elapsed 
since last transmission and the value of pref overgrows 
the maximum value of pot when the elapsed time is 50. 
Intuitively, pre^ sets zero importance on going towards a 
transmitter if the last transmission occurred not more then 50 
time units ago. On the other hand, pref 2 rises quite slowly at 
the beginning and very quickly later. In contrast, the function 
pref 3 grows very fast in the beginning and its growth slows 
down. 
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Fig. 2: Total size of data collected since the last transmission with 
respect to time depicted for each executed run. 
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TABLE II: Statistical results for the reward per transition (r/T) and 
the time between consecutive surveys (t) for different choices of 
pot/pref functions (in the header). AVG is the mean of average 
computed on each run and VAR the mean of variance computed on 
each run. v shows the percentage variance of the average among 
the runs. 



For each of 6 instances we executed 5 runs of 100 
iterations of the online planner. The sizes of the data col- 
lected in time are depicted in Fig. [2] Table [II] shows the 
mean of average reward per transition and the time between 
consecutive surveys, respectively. As expected, the faster the 
preference function grows with time since the last survey, 
the smaller the reward per transition and the shorter the 
time between consecutive transmissions are. For prefj and 
pref 2 , the difference in the reward per transition is not high, 
since in both cases the collection of rewards is preferred 
in the beginning, whereas pref 3 is very steep and therefore 
drives the robot towards transmitter quickly. Function potj 
computing the maximal sum of rewards that can be collected 
gives, as expected, higher average and lower variance for 
both objectives comparing to pot 2 that aims to collect big 
packages. 

The experiments were run on Mac OS X 10.7.3 with 



2.7 GHz Intel Core i5 and 4 GB DDR3 memory. The BA 
had 8 states (3 accepting) and it satisfied the condition for 



optimality from Sec. IV-C The product automaton had 800 
states. The offline part of Alg. [2] took 6 seconds and one 
iteration of the online planning algorithm 1-2 milliseconds. 

VI. Conclusions and Future Work 

We proposed a general framework for robot motion plan- 
ning in environment with dynamically changing rewards. 
While a high-level surveillance mission is guaranteed to 
be accomplished, the user-defined priorities on trade-off 
between the surveillance frequency and the reward collection 
are taken into account. The motion of the robot is modeled 
as a weighted transition system. Although the weights are 
in this paper interpreted as time durations of the transitions, 
they can be, in general interpreted, as any quantitative aspect, 
such as length or cost. In future work, we would like to 
extend the suggested framework for systems that are mod- 
eled as Markov decision processes and to reaching solution 
optimality for special subclasses of the reward dynamics. Our 
plan is also to extend the implementation of the framework. 
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